We propose a new neural network design paradigm Reversible Column Network (RevCol). The main body of RevCol is composed of multiple copies of subnetworks, named columns respectively, between which multi-level reversible connections are employed. Such architectural scheme attributes RevCol very different behavior from conventional networks: during forward propagation, features in RevCol are learned to be gradually disentangled when passing through each column, whose total information is maintained rather than compressed or discarded as other network does. Our experiments suggest that CNN-style RevCol models can achieve very competitive performances on multiple computer vision tasks such as image classification, object detection and semantic segmentation, especially with large parameter budget and large dataset. For example, after ImageNet-22K pre-training, RevCol-XL obtains 88.2% ImageNet-1K accuracy. Given more pre-training data, our largest model RevCol-H reaches 90.0% on ImageNet-1K, 63.8% APbox on COCO detection minival set, 61.0% mIoU on ADE20k segmentation. To our knowledge, it is the best COCO detection and ADE20k segmentation result among pure (static) CNN models. Moreover, as a general macro architecture fashion, RevCol can also be introduced into transformers or other neural networks, which is demonstrated to improve the performances in both computer vision and NLP tasks. We release code and models at https://github.com/megvii-research/RevCol
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
Whole-slide images (WSI) in computational pathology have high resolution with gigapixel size, but are generally with sparse regions of interest, which leads to weak diagnostic relevance and data inefficiency for each area in the slide. Most of the existing methods rely on a multiple instance learning framework that requires densely sampling local patches at high magnification. The limitation is evident in the application stage as the heavy computation for extracting patch-level features is inevitable. In this paper, we develop RLogist, a benchmarking deep reinforcement learning (DRL) method for fast observation strategy on WSIs. Imitating the diagnostic logic of human pathologists, our RL agent learns how to find regions of observation value and obtain representative features across multiple resolution levels, without having to analyze each part of the WSI at the high magnification. We benchmark our method on two whole-slide level classification tasks, including detection of metastases in WSIs of lymph node sections, and subtyping of lung cancer. Experimental results demonstrate that RLogist achieves competitive classification performance compared to typical multiple instance learning algorithms, while having a significantly short observation path. In addition, the observation path given by RLogist provides good decision-making interpretability, and its ability of reading path navigation can potentially be used by pathologists for educational/assistive purposes. Our code is available at: \url{https://github.com/tencent-ailab/RLogist}.
translated by 谷歌翻译
这项研究旨在通过添加从野外视频中学到的唇部动画来使元角色更现实。为了实现这一目标,我们的方法是扩展Tacotron 2文本到语音合成器,以在一次通过时与MEL频谱一起生成唇部运动。编码器和栅极层的权重在LJ语音1.1数据集上进行了预训练,而解码器则在从LRS 3数据集中提取的93个TED谈话视频中重新训练。我们的新型解码器预测,使用OpenFace 2.0 Landmark预测器自动提取的标签,可以在时间上跨20个唇部标记位置位移。训练在7小时内使用不到5分钟的视频收敛。我们进行了前/后网络和预训练的编码器权重的消融研究,以证明音频和视觉语音数据之间传输学习的有效性。
translated by 谷歌翻译
公平性是一个标准,重点是评估不同人口组的算法性能,它引起了自然语言处理,推荐系统和面部识别的关注。由于医学图像样本中有很多人口统计学属性,因此了解公平的概念,熟悉不公平的缓解技术,评估算法的公平程度并认识到医疗图像分析(媒体)中的公平问题中的挑战很重要。在本文中,我们首先给出了公平性的全面和精确的定义,然后通过在媒体中引入当前使用的技术中使用的技术。之后,我们列出了包含人口统计属性的公共医疗图像数据集,以促进公平研究并总结有关媒体公平性的当前算法。为了帮助更好地理解公平性,并引起人们对媒体中与公平性有关的问题的关注,进行了实验,比较公平性和数据失衡之间的差异,验证各种媒体任务中不公平的存在,尤其是在分类,细分和检测以及评估不公平缓解算法的有效性。最后,我们以媒体公平性的机会和挑战得出结论。
translated by 谷歌翻译
在本文中,我们提出了预测的梯度下降(PGD)算法,以通过嘈杂的非线性测量值进行信号估计。我们假设未知的$ p $维信号位于$ l $ -Lipschitz连续生成模型的范围内,具有有限的$ k $二维输入。特别是,我们考虑了两种情况,即非线性链接函数是未知或已知的情况。对于未知的非线性,类似于\ cite {liu2020循环},我们做出了次高斯观察结果的假设,并提出了线性最小二乘估计器。我们表明,当没有表示误差并且传感向量为高斯时,大约是$ o(k \ log l)$样品足以确保PGD算法将线性收敛到使用任意初始化的最佳统计率的点。对于已知的非线性,我们假设单调性如\ cite {yang2016sparse}中,并在传感向量上做出更弱的假设并允许表示误差。我们提出了一个非线性最小二乘估计器,该估计量可以保证享有最佳的统计率。提供了相应的PGD算法,并显示出使用任意初始化将线性收敛到估算器。此外,我们在图像数据集上提出了实验结果,以证明我们的PGD算法的性能。
translated by 谷歌翻译
近年来,人们见证了应用上下文框架以提高对象检测作为视频对象检测的性能的趋势。现有方法通常一次汇总功能以增强功能。但是,这些方法通常缺少来自相邻帧的空间信息,并且缺乏功能聚合不足。为了解决这些问题,我们执行一种渐进式方式来引入时间信息和空间信息以进行集成增强。时间信息由时间特征聚合模型(TFAM)引入,通过在上下文框架和目标框架之间进行注意机制(即要检测到的框架)。同时,我们采用空间过渡意识模型(StAM)来传达每个上下文框架和目标框架之间的位置过渡信息。我们的PTSeformer建立在基于变压器的检测器DETR上,还遵循端到端的方式,以避免重大的后处理程序,同时在Imagenet VID数据集上获得88.1%的地图。代码可在https://github.com/hon-wong/ptseformer上找到。
translated by 谷歌翻译
终身语言学习旨在流式传输学习NLP任务,同时保留对先前任务的知识。基于语言模型和以下无数据约束方法的先前作品探索了所有数据的格式,因为“ begin token(\ textit {b}) +上下文(\ textit {c}) +问题(\ textit {q}) +答案(\ textit {a})对于不同的任务。但是,由于以下原因,当上一个任务的伪数据不足时,它们仍然遭受灾难性的遗忘,并且会加剧:(1)模型难以生成任务处理的伪数据,(2)\ textit {a}易于使用{a} \ textIt {a}和\ textit {c}被\ textit {q}分开时错误,因为\ textit {c}的信息在生成\ textit {a}之前会减小。因此,我们首先提出问问题和重播问题(AQF-RQ),包括一种新颖的数据格式“ \ textit {bqca}”和一项新的培训任务,以培训先前任务的伪造问题。实验结果表明,AQF-RQ使模型更容易生成匹配相应任务的更多伪数据,并且在任务边界既明确又不清楚时,对相应的任务匹配,对伪data的足够和不足。与多任务学习相比,AQF-RQ仅能达到0.36 \%的性能。
translated by 谷歌翻译
两栖地面汽车将飞行和驾驶模式融合在一起,以实现更灵活的空中行动能力,并且最近受到了越来越多的关注。通过分析现有的两栖车辆,我们强调了在复杂的三维城市运输系统中有效使用两栖车辆的自动驾驶功能。我们审查并总结了现有两栖车辆设计中智能飞行驾驶的关键促成技术,确定主要的技术障碍,并提出潜在的解决方案,以实现未来的研究和创新。本文旨在作为研究和开发智能两栖车辆的指南,以实现未来的城市运输。
translated by 谷歌翻译
给定图像和参考字幕,图像标题编辑任务旨在纠正未对准错误并生成精制的字幕。但是,所有现有的字幕编辑作品都是隐式模型,即它们直接生成精制字幕而无需与参考标题明确连接。在本文中,我们介绍了一项新任务:显式字幕编辑(ECE)。 ECE模型明确生成了一系列编辑操作,此编辑操作序列可以将参考字幕转换为精制的字幕。与隐式编辑相比,ECE具有多个优点:1)可解释:它可以追踪整个编辑路径。 2)编辑有效:它只需要修改几个单词。 3)像人类一样:它类似于人类执行字幕编辑的方式,并试图保持原始句子结构。为了解决这项新任务,我们提出了第一个ECE模型:Tiger。 Tiger是一种非自动回形变压器的模型,由三个模块组成:Tagger_del,Tagger_Add和Inserter。具体而言,Tagger_del决定是否应该保留每个单词,Tagger_add决定添加新单词的位置,而Inserster预测了添加的特定单词。为了进一步促进ECE研究,我们分别重新组织了两个现有数据集,分别为Coco-EE和FlickR30K-EE,提出了两个新的ECE基准。两个基准上的大量消融都证明了老虎的有效性。
translated by 谷歌翻译